# Curation Bubbles

This repository contains the replication files for:

> Green, Jon, Stefan McCabe, Sarah Shugars, Hanyu Chwe, Luke Horgan, Shuyang Cao and David Lazer. Conditionally accepted, _American Political Science Review_. “Curation Bubbles.” https://osf.io/udfaz/

The analysis combines a lot of data sources and has a lot of moving parts, so this README is intended to provide a quick overview.

In the paper, we draw on two primary data sources: a dataset of voter-file-linked Twitter data, and Meta's [URL Shares Dataset](https://socialscience.one/rfps). Each of these has significant data-access limitations. The raw Tweets and voter data cannot be shared due to Twitter's Terms of Use and our data-use agreement with the voter file vendor. We provide code for constructing the URL- and domain-level aggregate data analyzed in this paper. The URL Shares Dataset must be accessed through Meta's FORT clean room environment.

## Generating Data Files
From the FORT environment run `bash make_fb_data.sh` to run the relevant SQL queries. On a system with Spark and access to the raw voter-file-linked Tweets, run `bash make_tw_data.sh` to run the relevant SQL queries.

## Generating Figures

From the FORT environment, run `bash make_fb_plots.sh` to generate Figures 2b, 4b, 5 and 8. This also generates relevant supplementary plots.

From an environment with access to the aggregate Twitter data, run `bash make_tw_plots.sh` to generate Figures 2a, 4a, 6, and 7. This also generates relevant supplementary plots. If the environment has access to the restricted individual-level voter data, it will also generate Figure 3.`

To generate the rest of the supplementary figures (Figures A.1 and Figures G.1-G.3), run `bash make_other_plots.sh`. 


## Manifest of Files Included in Dataverse

### Included Data Files
* `countypres_2000-2020.tab` County-level election returns, via [MIT Election Lab](https://electionlab.mit.edu/data).
* `domain_reference_table.tsv`: Summaries of domain-level audience scores, sourced from the Twitter data.
* `url_reference_table.tsv`: Summaries of URL-level audience scores, sourced from the Twitter data.
* `qualtrics_hand_coding_scrubbed.sav`: SPSS/Qualtrics file containing results of hand-coding exercise (without names or unique identifiers); see Appendix .
* `robertson_bakshy_scores.tab`: Earlier domain-level partisanship measures, via [Ronald Robertson](https://github.com/gitronald/domains).
* `sampled_headlines.tab`: The URLs sampled for inclusion in the hand-coding exercise.

### Scripts
* `fb_bubble_plots.R`: Generate Figure 8.
* `fb_complete_preprocessing.py`: Join in auxiliary datasets (political classifier) to FB data.
* `fb_construct_intervals.R`: Construct analytic and bootstrapped intervals for FB data.
* `fb_dist_plots.R`: Generate Figures 2b and 5.
* `fb_run_collection_queries.py`: Run SQL queries to generate FB data from raw (Social Science One).
* `fb_selected_domains_plots.py`: Generate Figure 4b.
* `make_fb_plots.sh`: One-button replication script for Facebook plots.
* `make_other_plots.sh`: One-button replication scripts for supplementary (non-FB, non-TW) plots.
* `make_tw_plots.sh`: One-button replication script for Twitter plots.
* `stylized_example.py`: Generate Figure A.1.
* `tw_bubble_plots.R`: Generate Figure 7.
* `tw_compare_old_methods.R`: Generate Figures D.1 and D.2.
* `tw_dist_plots.R`: Generate Figure 2a.
* `tw_generate_scores.R`: Run SQL queries to generate TW data from raw.
* `tw_hand_coding_plot.R`: Generate Figure 6.
* `tw_preprocess.R`: construct analytic and bootstrapped intervals for FB data.
* `tw_selected_domains_plots.R`: Generate Figure 4a.
* `tw_user_plots.R`: Generate Figure 3.
* `voter_file_validation_plots.R`: Generate Figures G.1-G.3.

## Codebook of Reference Tables
### URL table
- `url`: the URL
- `domain`: the URL's parent domain
- `headline`: if a headline was found, the headline
- `blurb`: if a blurb was found, the blurb
- `politics_score`: the modeled likelihood that `blurb` contained political content
- `politics_label`: binary indicator for `politics_score` > 0.9.
- `date`: the date the URL was first shared
- `url_score_orig`: the (Twitter-based) URL-level audience score, using trichotimized modeled partisanship
- `url_score_continuous`: the (Twitter-based) URL-level audience score, using continuous modeled partisanship
- `url_score_reg`: the (Twitter-based) URL-level audience score, using voter registration (D/R only)
- `url_score_reg_ind`: the (Twitter-based) URL-level audience score, using voter registration (allowing Independents to contribute to the denominator)
- `num_shares`: the total number of times the URL was shared
- `num_dem_shares`: the total number of times the URL was shared by a Democrat
- `num_ind_shares`: the total number of times the URL was shared by an Independent
- `num_rep_shares`: the total number of times the URL was shared by a Republican

### Domain table
- `domain`: the domain name.
- `pct_political`: the proportion of URLs is in this domain with at least 10 shares which were labelled as political.
- `num_shares`: the total number of shares.
- `num_dem_shares`: the number of shares among Democrats.
- `num_ind_shares`: the number of shares among Independents.
- `num_rep_shares`: the number of shares among Republicans.
- `domain_score_orig`: the (Twitter-based) domain-level audience score, using trichotimized modeled partisanship, without subsetting to political URLs.
- `domain_score_continuous`: the (Twitter-based) domain-level audience score, using continuous modeled partisanship, without subsetting to political URLs.
- `domain_score_reg`: the (Twitter-based) domain-level audience score, using voter registration (D/R only), without subsetting to political URLs.
- `domain_score_reg_ind`: the (Twitter-based) domain-level audience score, using voter registration (allowing Independents to contribute to the denominator), without subsetting to political URLs.
- `domain_score_orig`: the (Twitter-based) domain-level audience score, using trichotimized modeled partisanship, subsetting to political URLs.
- `domain_score_continuous`: the (Twitter-based) domain-level audience score, using continuous modeled partisanship, subsetting to political URLs. (This is what is used most often in the paper.)
- `domain_score_reg`: the (Twitter-based) domain-level audience score, using voter registration (D/R only), subsetting to political URLs.
- `domain_score_reg_ind`: the (Twitter-based) domain-level audience score, using voter registration (allowing Independents to contribute to the denominator), subsetting to political URLs.
